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Semiparametric forecasting and filtering are introduced as a method of addressing 
model errors arising from unresolved physical phenomena. While traditional 
parametric models are able to learn high-dimensional systems from small data sets, 
their rigid parametric structure makes them vulnerable to model error. On the other 
hand, nonparametric models have a very flexible structure, but they suffer from 
the curse-of-dimensionality and are not practical for high-dimensional systems. The 
semiparametric approach loosens the structure of a parametric model by fitting a data- 
driven nonparametric model for the parameters. Given a parametric dynamical model 
and a noisy data set of historical observations, an adaptive Kalman filter is used to 
extract a time-series of the parameter values. A nonparametric forecasting model for 
the parameters is built by projecting the discrete shift map onto a data-driven basis of 
smooth functions. Existing techniques for filtering and forecasting algorithms extend 
naturally to the semiparametric model which can effectively compensate for model 
error, with forecasting skill approaching that of the perfect model. Semiparametric 
forecasting and filtering are a generalization of statistical semiparametric models to 
time-dependent distributions evolving under dynamical systems. 
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1. Introduction 

The backbone of our modern numerical weather prediction 
(NWP) is the probabilistic point of view that was taken by|Epstein 
|1969| | and |Leith| ( |l974j ) to account for uncertainty in the initial 
conditions. In the operational setting, ensemble forecasting of 
general circulation models (GCMs) has been the principal tool to 
realize the probabilistic forecasting approach at both the NCEP 
and ECMWF since early 1990s ( |Kalnay|2003| >. However, despite 
advances in computational capacity and observation networks, 
the GCMs poorly represent the variability of some physical 
processes, such as tropical convection. A significant challenge 
in representing this process, as pointed out in Fren kel et al. | 
< |2012| ), is the failure of current models to adequately represent the 
interactions between the large-scale atmospheric circulation and 
the convective-scale cloud processes. Such inadequate modeling, 
which is generally known as model error , is the fundamental 
barrier to improve medium-range and long-term forecasting skill 
in NWP. 

Many parametric modeling approaches have been proposed and 
successfully applied to compensate for model error in various 
contexts (for some examples see|Grabowski|200T JE£fa/J2007J 


2004a|b Majda and Grooms||2013[ |Majda and Harlim||2013[ 


Harlim et al. 2014, just to name a few). These approaches typically 


|Frenkel et q/.|20T2{|Majda and Grooms|20l3f Katsoulakis et al. \ 


rely on physical insights into specific problems for choosing the 
appropriate parametric form. In fact, it was rigorously shown 
in a simple context that it is possible to fully compensate for 
model error arising from unresolved scales; in the sense that 
one can simultaneously obtain optimal data assimilation and 
climatological statistical estimates assuming that one has access 
to the right stochastic parametric model ( |Berry and Harlim] 
|2014a|) . While stochastic parameterization techniques have been 
successful for many particular problems, this approach also has 
another limitation beyond the selection of the parametric form. 
Namely, determining the parameters in these models from a 
limited amount of noisy data can be nontrivial, especially when 
the parameters are not directly observed. This was shown in 
Berry and Harlim| ( |2014a[ ), which found that even when the 
appropriate parametric form is known, the choice of the stochastic 
parameter estimation scheme could have a significant effect on 
the results. While some parameter estimation methods have been 
developed which consistently provide accurate data assimilation 
and climatological statistics (for some examples see |Harlim et al. \ 
|2014| |Berry and Sauer||2013[ |Zhen and Harlim||2014| l, these 
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methods are numerically more expensive than the standard linear 
regression fitting method ( [Arnold et q/.|2013| . 

Recently, a nonparametric method for learning a dynamical 


system from a training data set was introduced in Berry and 
|Harlim| (|2014bj) and significantly generalized in Berry et al. 


(2015 I. These methods essentially provide a black-box model for 
forecasting distributions based on the training data. Moreover, it 
was shown in Berry and Harlim (2015a) that the nonparametric 
approach of |Berry et al. ~ ^2015 i provides high forecasting skill, 
and is competitive with some physics-based parametric models 
for predicting energetic modes of turbulence dynamical systems 
using relatively small training data sets (on the order of 10 3 data 
points). The results of Berry and Harlim]( |2015a[ l also demonstrate 
that, combined with the geometric time-delay embedding theory 
of B erry et 577] < |2013] >, the nonparametric method of |Berry et al. \ 
( [2015 I is effective even when the training data set is corrupted 
by noise. Building on these existing results, the nonparametric 
modeling approach developed in |Berry et al. | < [2015( ) will be the 
nonparametric core of the semiparametric modeling paradigm 
introduced in this paper. 

The novel aspect of the present paper, beyond [Berry| 
[et al. ( [2015] ), is to overcome the practical limitation of the 
nonparametric model to low dimensional dynamics. We overcome 
this limitation by assuming that we have an approximate or 
incomplete parametric model, and using nonparametric methods 
to fill in the missing components, in other words, to correct the 
model error. By assuming that the parametric model captures most 
of the variability, so that the model error is low-dimensional, 
the nonparametric model becomes feasible. Simultaneously, 
from the perspective of the modeling community, the goal of 
semiparametric modeling is to overcome model error by using 
historical data to ‘correct’ the existing physical models. These 
two perspectives indicate that semiparametric modeling has the 
potential to seamlessly blend the strengths of the parametric and 
nonparametric modeling approaches, thereby overcoming their 
complementary weaknesses. 

Semiparametric modeling is well known in the statistics 
literature as a flexible work-around for overcoming the curse- 
of-dimensionality l |Hardle|2004j i. Classical nonparametric models 
such as histograms or kernel density estimates can reconstruct 
an unknown time-independent density from a training data set, 
however the amount of data needed grows exponentially in the 
dimensionality of the data. Alternatively, estimating parameters 
for a parametric statistical model can often recover a high¬ 
dimensional density from a relatively small data set. However, 
imposing the rigid structure of the parametric form opens 
up the possibility of model error. Semiparametric models in 
statistics attempt to take an intermediate road by choosing a 
parametric form but allowing the parameters to vary and building 
nonparametric models for the parameters. Choosing the structure 
of the parametric form allows one to encode the prior knowledge 
about the data set, hopefully reducing the intrinsic dimensionality 
of the parameter space to the point where a nonparametric model 
can be used to capture the remaining variability. In this paper, 
we will extend this statistical idea of semiparametric modeling to 
estimating time-dependent densities of dynamical systems. 

Our plan is to demonstrate that one can use a nonparametric 
model constructed using the method of Berry et al. ( |2015| > instead 
of using a problem specific parametric model to compensate for 
the model error. Combining the partial parametric model with 
the nonparametric model requires new semiparametric forecasting 
and filtering algorithms. However, an important consideration 
is that we wish to maintain as much of the current parametric 
ensemble forecasting and filtering framework (as used in NWP) 
as possible. To achieve these goals, in Section[2]we will define the 
form of the model error which we will be able to address here, 


and we briefly review some simple stochastic parametric models 
which have been successful for certain types of model error. We 
will assume that the model error can be described by dynamically 
varying certain parameters in the parametric model, and that the 
evolution of these parameters is independent of the state variables 
of the parametric model. 

In Section [3] we introduce a semiparametric forecasting 
algorithm which combines the nonparametric forecast of |Berry| 
| et al. | ( [2015[ ) with a standard ensemble forecast method for 
the parametric model. For simplicity and clarity, in Section [3] 
we assume that a training data set of the varying parameters 
is available and that we are given noisy initial conditions for 
forecasting. In Sections [4] and [5] we will discuss additional 
strategies for dealing with the more realistic scenario when 
we only have noisy observations of the state variables of the 
parametric model, which is the common situation in practice. 
In particular, in Section [4] we use an adaptive filtering method 
developed in |Berry and S auer (2013]); [Berry and Harlim| ( |2014a[ ) 
to extract a time series of the varying parameters from the 
noisy observations. In Section [5] we introduce a semiparametric 
filter which combines a Ensemble Kalman Filter (EnKF) for 
the parametric model with the nonparametric model (learned 
from the time series recovered in Section [4]) in order to find 
initial conditions from the noisy observations. As a proof-of- 
concept example, we will demonstrate the semiparametric filter 
and forecast on the Lorenz-96 model with model error arising 
from a parameter which evolves according to low-dimensional 
deterministic chaotic or stochastic models. We close this paper 
with a short summary in Section[6] 

2. Problem Statement and Background 

We consider the problem of filtering and forecasting in the 
presence of model error. We assume that we are given a noisy 
time series of observations from a known dynamical model. 


x = f{x,e), (i) 

with observation function, 

y = h(x,6), (2) 

depending on parameters 9 which evolve according to an unknown 
stochastic model, 


9^g(9,W), (3) 

where W denotes a white noise process. We assume that the 
evolution of 6 is ergodic on a physical parameter domain M which 
may be a subset of the parameter space, 9 e M C R™. Moreover, 
we assume that the distribution of 9 evolves according to a Fokker- 
Planck equation with a unique equilibrium density p eq (9). A key 
assumption in this paper is that the evolution of the parameters, 9, 
does not depend on the state x. This says that the model error is 
homogeneous with respect to the state space, which is a restrictive 
assumption for practical applications. The key difficulty in solving 
the conditional or heterogeneous problem is that if g was allowed 
to vary with x there would typically not be enough data at each 
state a; to fit a nonparametric model; we discuss this further in 
Section [6] 

With the above problem statement in mind, we will first 
consider an idealized scenario where we have a training time 
series of the parameters 9. In this first scenario we will also bypass 
the filtering step and assume that we are given an initial state 
(x(t),9(t)) at some time t from which we wish to forecast the 
state into the future. Without knowing the model in 0. our goal 
is to first use the training time series to learn an approximate 
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model g ss g for 9. Subsequently, we combine this approximate 
model g for 9 with the parametric model 0 in order to forecast 
the initial condition (x(t),9(t)). In the remainder of this section 
we will review several standard methods for building approximate 
models, g. 

Given a discrete time series of historical values of the 
parameters {9i = l there are several standard approaches 

for building a forecast model for (x, 9). The simplest method is to 
simply ignore the training data, and hold the value of 9 constant 
at the initial parameter value 9(t), so that g = 0, and we will refer 
to this method as the persistence model. The persistence model 
integrates the system 0 with 9(t + t) = 9(t) held constant at 
the initial value 9(t), in order to obtain the forecast of x. When 
the evolution of 9 is deterministic and very slow relative to x, 
the persistence model can give reasonable short term forecasts, 
assuming that the system is still stable with 9 fixed. Of course, 
if the evolution of 9 is very fast, the persistence model will 
quickly diverge from the true state. Moreover, the persistence 
model will typically not capture the correct long term mean of 
the true system, leading to a very biased long term forecast which 
can easily be worse than simply predicting the mean value of x. 

When the evolution of 9 is very fast, we can assume that, 
relative to x, the parameters 9 quickly equilibrate to their 
climatological distribution. Assuming that the evolution of 9 is 
ergodic and that we have a sufficiently long training data set 
{9i}, we can emulate the equilibrium density by simply drawing 
random parameters from the historical data. To use this in the 
forecast, for each integration step of the forecast model 0 
we draw a random value of 9 from the historical data; this 
strategy is similar to the heterogeneous multiscale method (HMM) 
framework ( E et «/. |2007| ). For the long term forecast, the HMM 
gives an unbiased forecast model which recovers the correct 
equilibrium density, however for the short term forecast, the HMM 
completely ignores the known initial parameter value 9(t). 

So far, the persistence model assumes the evolution of the 
parameters is slow and ignores the training data, whereas the 
HMM assumes the evolution of the parameters is fast and 
ignores the initial state 9(t). It turns out that blending these 
two approaches is quite challenging. A method which attempts 
to blend these, in the sense that it moves the initial state 9{t) 
toward the long term mean 9 = ^ yT = , at an appropriate 
speed, was introduced as part of the MTV strategy for reduced 
climate modeling (Majd a et al. | 1 999[ >. In |Majda et al. |(T9 99) they 
approximate the interaction between the unresolved scales with 
a parametric model involving a linear damping term and white 
noise stochastic forcing. In our context, since g is assumed to be 
independent of x, we simply have a linear stochastic model, 

9 = g(9, W) = a(9 -9) + oW, (4) 

where a = 1/Tg is determined by the correlation time Tg of 
the historical data and the stochastic forcing a = ^/2avar(0) is 
determined by the variance of the historical data. This approach 
is known as the mean stochastic model (MSM) (see Majda et al. 
[2010| . Assuming that the initial density for the parameters is 
Gaussian with mean given by the initial state 9(t) and an initial 
variance s(t ) (possibly coming from a Kalman filter), the density 
of the MSM forecast at time t + r is a Gaussian and its mean and 
variance can be determined explicitly. Using the analytic solution 
for the linear (and Gaussian) model in 0 - we can generate 
random samples from the Gaussian forecast at each integration 
step of the ensemble forecast. While the MSM effectively uses 
the initial state, it only captures two moments (namely the mean 
and covariance) and the two-time correlation statistics of the 
climatological distribution. 

In our numerical experiments below, we will compare the 
forecasting skill of the semiparametric model with the three 


approaches mentioned above, namely the persistence, HMM, and 
MSM models, in addition to the perfect model. There are of 
course other more sophisticated parametric stochastic models, 
and we should point out that a comparison with these methods 
was already made in the context of nonparametric forecasting 
of turbulent modes in |Berry and Harlim| |20 15a). In this paper 
we assume that all of the physical knowledge has already been 
incorporated into the parametric model, so that we have absolutely 
no knowledge of the system 0 which could be used to suggest a 
better parametric form. 

3. Semiparametric Forecasting 

We will assume that the model for x is nonlinear and potentially 
high-dimensional and chaotic so that the most practical method of 
forecasting x is simply to integrate an ensemble of initial states 
forward in time using the model 0. Semiparametric forecasting 
consists of using the training data to learn a nonparametric model 
for 9 and combining this model with the dynamical model 0 to 
improve the forecast of an initial state (x(t), 9{t)). 

To simplify the discussion in this section, we assume that we 
have a training data set consisting of a historical time series 
{0i = 9(ti)} of the parameters and we are given an initial state 
x(t) and the initial parameter values 9(t ) at some time t, from 
which we would like to forecast the state x into the future. We 
will consider the more realistic situation, where the training data 
set and initial conditions for 9 is not available, in Sections [4] and 
[5] In the remainder of this section, under the current idealized 
scenario, we introduce the semiparametric forecasting approach. 
In Section |3.1| we briefly review the nonparametric model of 
(Berry et q /.|2015| for 9. In Section [.L2| we show how to combine 
this nonparametric model with the known parametric model, /, 
to perform a semiparametric forecast. Finally, in Section |3.3| we 
demonstrate the semiparametric forecast on an example based on 
a Lorenz-96 model, where we introduce a complex model error 
which is governed by the Lorenz-63 model. 

3.1. Step 1: Forecasting parameters with the nonparametric 
model 


We assume that the true dynamics of the parameters 9{t ) are 
constrained to a physical parameter domain M C R™, and evolve 
according to a stochastically forced dynamical system 0 - Notice 
that we allow the data to be constrained to a domain M inside 
the parameter space R n , so even when the dimension, n, of the 
parameter space is high, the data requirements will only depend on 
the dimension of A4. We emphasize that the domain M , as well 
as g, are unknown and will be learned implicitly from the training 
data set. Moreover, the parameters 9 may include variables which 
do not appear explicitly in the model 0, but we assume that 
our training data set only consists of observable variables which 
appear explicitly in 0. In order to compensate for unobserved 
variables in the dynamics of the parameters 9 which appear in 0. 
we will apply a time-delay embedding to the available variables 


(Sauer et al. 

1991a 

Stark 

1999 

Stark et a/.||2003[ Berry et al. 

2013 1 

Gianna 

ds anc 

Majda 

2011 

2012; Berry and Harlim 2015a). 


we are interested in constructing a forecasting model so that 
given an initial density p(9, t ) at time t we can estimate the 
density p(9, t + t) at time t + r, where t > 0. We assume that 
9i is already the delay coordinate embedding of the available 
parameters which appear explicitly in the parametric model 0. 
Probabilistically, the evolution of the density, p, in the forecasting 
problem is governed by the Fokker-Planck equation. 


dp _ r * 
dt 


P, p{0,t)=Pt(0), 


( 5 ) 
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where C* is the Fokker-Planck operator, which we assume exists 
for the system 0- We note that the equilibrium distribution, 
Peq(9), of the underlying dynamics 0 satisfies, C*p e q = 0. 

The nonparametric forecasting algorithm introduced in [Berry| 
| et al. | \2() 15j i is called the diffusion forecast. The central idea of 
the diffusion forecast is to project the forecasting problem 0 
onto a basis of smooth real-valued functions {i fij{9)} defined 
on the parameter domain A4. We will discuss below how the 
basis {i pj} will be estimated using the training data; for now 
we assume that the basis functions are known. In order to find 
the forecast density p(9, t + r) from the initial density p(9 , t), we 
first project this function onto the basis {pj} by computing the 
projection coefficients Cj{t) = (p(-,t),ipj). We can then forecast 
the coefficients c(t) with a linear map c(t + r) = Ac(t), whose 
components are defined by Aij = E [{pj, SVi)p oq ], where S is 
the discrete shift map defined by Spi{9f) = pi{0 i+ 1 ). It was 
shown in ( (Berry et a/.||2015| that the linear map A is exactly 
the forecasting operator (which takes the density p{9 , t) to the 
forecast p{9, t + r)) represented in the basis {pj}. The time step, 
t, of the forecasting operator is determined by the sampling time 
T = t i+ 1 - ti of the training data. Finally, we can reconstruct the 
forecast density by, 

p(9,t + r) = y^Cjjt + T)p eq (9)pj{9). 

i 

These formula were all derived in ( |Berry et fl/.||2015~] >, and we 
now summarize the nonparametric forecasting approach with the 
following diagram, 

,. . Nonparametric Forecast . _ . 

p(9,t) - 1 -► p(0,t + T ) 




c j ( PjPe q 


m 


A lj=V\.{Vj,S V i) Pe ^\ 

- > c(t + T) 


Ac(t). 


In the above diagram, the nonparametric forecasting algorithm 
follows the solid arrows: projecting onto the basis, applying 
the linear map A (which computes the forecast), and then 
reconstructing the forecast density. 

Given a basis {pj} and an estimate of the equilibrium density 
p e q , it was shown in (Berry et fl/.] |2015| > that the above inner 
products can be estimated as Monte-Carlo integrals on the training 
data set {9i} given by. 


Cj(t) = (p{-,t),pj) « ^^ J p{9 i ,t)pj{9 i )/p eq {9f) (6) 

i =1 

1 iV " 1 

Aj = (‘Pj’ S w) Peq ~ JfZTJ Vj{^i)Vl{9 i+ 1). (7) 

i= 1 

Notice that we only need estimates of the basis {pj} and the 
density p e q evaluated on the training data set. Of course, in 
practice we will truncate the projections onto a finite number M 
of the basis functions, pj, so that we implicitly project the forecast 
density into the span of the first M basis functions, {pj}p =1 . 
Notice that by projecting and reconstructing the density p in the 
basis {pj} (see the diagram above), we have projected the density 
in the ambient space onto the physical domain M, as determined 
by the historical data set. In fact, {pj} is a generalized Fourier 
basis and since we only use M basis elements, it is possible to 
have Gibbs phenomenon oscillations in the reconstructed density, 
which can lead to negative values. In order to avoid these negative 
values, whenever we reconstruct the density we will always take 
the max of the reconstructed values and zero, and then renormalize 


the density by dividing by. 


L 


N 


Z= I p(9,t ): 

leeM 


■^J^p(9i,t)/Peq(9i) 


( 8 ) 


i=i 


which estimates the normalization factor. 

The final piece of the nonparametric forecasting algorithm is 


choosing an appropriate basis {pj}. It was shown in Berry et al. 


( |2015| > that the optimal basis for minimizing the error between 
the estimate A of the matrix A are the eigenfunctions of an 
elliptic operator £. The operator £ is the generator of a stochastic 
gradient flow in the potential field U{9) = — logp e q(0), where 
Peq is the equilibrium density of 0- By choosing this basis 
of eigenfunctions, it was shown in [Berry et~. ( 2015) that the 
difference between the estimate Aij and the true coefficient A/j 
is an error of order r/N in probability. Most importantly, these 
eigenfunctions, {pj}, and the density p e q, can be estimated with 
the diffusion maps algorithm ( [Coifman and Lafon|2006||Berry and| 
|Harlim|2015b| l. 

Intuitively, the diffusion maps algorithm uses the data set 
{0i}£Li, to approximate the operator £. For any function F : 
M — ¥ R we can represent F on the data set {9^} by an N x 1 
vector Fj = F{9j). The operator £ is then approximated by an 
N x N matrix L, in the sense that the matrix-vector product LF 
has entries ( LF)i « £(F)(0,) so that the vector LF approximates 
the function CF on the data set. The algorithm for constructing 
the matrix L, and the theory proving its convergence to £ in 
the limit of large data, were established in Berry and Harlim 
(|2015b|> for sparsely sampled data; generalizing the seminal result 


of Coifman and Lafon] ([2006 ). The eigenvectors of the matrix L 
approximate the desired eigenfunctions evaluated on the training 
data set, as required in l(6]» and 0. The diffusion maps algorithm 
simultaneously provides a kernel density estimate of p e q evaluated 
on the training data set as required in <[6]l and 0- As a result, 
the reconstructed forecast density, p{9, t + r), is represented by 
its values on the training data set, p(9i,t + r). In this article, 
we use the algorithm of |Berry and Harlim| ( |2015b| > since the 
equilibrium density, p eq , may take values arbitrarily close to zero, 
which would imply regions of sparse sampling on M. For a brief 
overview of the theory of Berry and Harlim l |2015bj l, see the 
Supplementary Material of Berry et al. (2015), and for a simple 
guide to the algorithm see the appendix of Berry and Harlim| 
l |20T5a| ). 


3.2. Step 2: Combining the nonparametric and parametric 
forecast models 

In this section, we introduce the semiparametric forecasting 
scheme which combines the parametric model 0 for x and the 
nonparametric model of Section 0 for 9. We assume that we 
are given an ensemble of K initial conditions {x k {t)}]^ = i which 
represent the distribution of the variable x at the initial time t, 
and an initial density p{9, t) for the parameters, and our goal is to 
forecast these two quantities to the future time t + r. 

We first describe how to forecast the ensemble {* fc }. Since 
the model for x is parametric of the form /( x, 9), the evolution 
of x depends on the parameters 9. In order to forecast each 
ensemble member, x k , with the parametric model, we need to 
produce samples {9 k }^ = 0 of the density p(9,t). We will sample 
the density p{9, t) using the standard rejection method, where the 
reference density is the equilibrium density p e q , which is also the 
sampling measure by our ergodicity assumption on 9. To perform 
the rejection sampling, we set P = nmxi{p(9i,t)/p e q(9i)} to be 
the upper bound of the ratio between the two distributions. Notice 
that since both densities are known on the historical training 
data set {0, } we can easily compute this optimal upper bound. 
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Since the historical training data points {#,} are samples of p eq , 
we can easily produce a random sample of the density p eq by 
randomly choosing a training data point 9 m where m £ {1, N} 
is a uniformly random index. We then choose a uniform random 
number £ £ [0,1] and we accept 9 m as a sample of p(9, t ) if 
£ < p(9 m ,t) / (Pp e q(9m)) and otherwise we reject 9 m and redraw 
until we accept. We repeat this rejection sampling until we have 
K samples {9 k (t)}^ =1 from the density p(9,t). Each ensemble 
member (a; fc (i)} is then paired with the corresponding sample 
from p(9,t) to form the combined ensemble {(x k (t), 9 k (t))}. 
We can now integrate the parametric model x = f(x,9), using 
each ensemble state (x k (t),9 k (t)), for time r in order to find the 
forecast ensemble { x k (t + t)} at time t + r. 

The second part of the semiparametric forecasting algorithm 
is exactly the diffusion forecasting scheme described in < [3.1[ >. 
To repeat, we project p(9,t) onto the basis {tpj} to find the 
coefficients Cj(t) as in 0- We then forecast the density on 
the parameters using the matrix A from 0 so that c(t + t) = 
Ac(t) and we can then reconstruct the density at time t + r. 
Subsequently, we sample p(9, t + r) using the same rejection 
sampling procedure to obtain an ensemble {^(t + r)} which 
we then pair with {x k (t + r)} to form {(x k (t + t) , 9 k (t + 
t))} and again integrate to find {x k (t + 2r)}. We can then 
repeat the procedure by finding c(f + 2 r) = Ac(t + r) = A 2 c(t), 
reconstructing the density p(9, t + 2r) and so on for as long 
as we would like to forecast. Therefore, in addition to the 
nonparametric forecasting described before, the semiparametric 
forecasting algorithm consists of sampling the nonparametric 
forecast density at each forecast step and using these samples for 
integrating an ensemble forecast of the parametric model. 

We summarize the semiparametric forecasting algorithm with 
the following diagram, 

(x k (t),9 k (t)) --> (x k (t + r),9 k (t + T)) 


6 k (t) 


e k (t+r) 


p{9,t) 


Nonparametric Forecast 


p(9,t + r) 




jtpjPe, 


c(t) 


Aij — ,S<pi )p 


-¥ c(t + t) = Ac(t). 


Notice that the nonparametric model iterates independently from 
the parametric model, which follows from the assumption that the 
model 0 for 9 is independent of the state variable x. Furthermore, 
the dashed line is not taken directly but is a summary of the path 
of solid lines which uses the projection onto the basis {cpj}. The 
parametric model is used to forecast the ensemble of states x k (t) 
and at each step the ensemble of parameters 9 k (t) are updated by 
sampling the density p(9, t) from the nonparametric forecast. As 


in Section 3.1 the initial density p(9,t), which can be arbitrary, 


will be evaluated on the data set, and all subsequent densities 
p(9 , t + tot) are represented by the values p(9i,t + tot). 

We should note that the ensemble members 9 k (t + mr) are not 
actually time series solutions of 0. since at each forecast step we 
choose a completely new set of random samples independently 
from the samples of the previous step. Producing a consistent 
time series for each ensemble forecast 9 k (t + tot) would require 
sampling the conditional density at each time given the sample 
at the previous time. This would allow more complex integration 
schemes to be considered for combining the samples {6 k } with 
the ensemble {* fc }, however, for a small ensemble size it would 
result in a more biased forecast. 


3.3. Application of semiparametric modeling to a chaotic 
system with model error 

In this example we will introduce model error into the 40- 
dimensional Lorenz-96 < jLorcnz| 1996) system, 

±i = Xi-ix i+ i - Xi-iXi-2 - Xi + 8, (9) 

where the indices are taken modulo 40, which is known to have 
chaotic dynamics. Notice that the system is dissipative in the sense 
that the evolution of the energy-like quantity, S = x 2 /40 is 

simply S = —2 S + 2/5 JT a^. We will now introduce a model 
error by assuming that the evolution of the true system is governed 

by, 


ii = f{xi,9) = 6xi-ix i+ i - Xi-iXi-2 -Xi + 8 
9 = (ai/40 + 1) 

ai = 10 (a 2 — ai)/e, (10) 

a 2 = (28ai — o 2 — aia 3 )/e, 
a 3 = (aia 2 - 8a 3 /3)/e, 

which we call the L96-L63 system, where we replace the constant 
coefficient in the first quadratic term of 0 with a parameter 9. The 
parameter 9 is an appropriate rescaling of the first component, ai, 
of the chaotic three-dimensional Lorenz-63 l |Lorenz|1963[ l system. 
The system © does not have the dissipative property of 0 but is 
empirically stable in long numerical simulations. Empirically, we 
found that the system ( | 10| ) became very unstable if 9 was allowed 
outside the interval [0.5,1.5]. The constant e is included to allow 
differences in the time-scale between the parametric model for x 
and the parameters 9. In this section we will use t = 1 however 
we will consider faster (e < 1) and slower (e > 1) time-scales in 
subsequent sections. In Figure [T] we compare the dynamics of 
these two systems. Notice that the spatio-temporal patterns of 
these two very different dynamical systems appear deceptively 
similar. 

While this example is a proof of concept for semiparametric 
forecasting where we assume we are given a training time series, 
in real applications we can at most extract the time series of 9 
which appears explicitly in the parametric model. In Section [4] 
we describe a method of extracting a time series of 9 from noisy 
observations, and we will see that we cannot hope to recover the 
three hidden variables {ai, a 2 , a 3 }. Since the hidden variables do 
not appear explicitly in the parametric model / in ( | 1 Q| ), we will 
train the nonparametric model using only a time series of 5000 
values of 9i = 9{ti) with spacing Af = ti +1 — L; = 0.1. In order 
to account for the hidden variables in the nonparametric model, 
we perform a time delay embedding using 4-lags, meaning that 
we replace the one-dimensional time series 9, with the time-delay 
embedding coordinates, (0j, 0j_i,..., 9 i _f) Y , prior to building the 
nonparametric model. 

In order to compare the forecasting skill of various methods 
with imperfect initial conditions, we introduce a Gaussian 
perturbation to the true state, 

(x (t), 9(f)) = (x(t), 9(f)) + 7? t , Vt ~ Af (0, C(t)), 


where the perturbation r/t has mean zero and variance equal to 
0.1% of the long term variance of each variable. In other words, 
we define the covariance matrix. 


C(f) = 


C X x(f) 67 x #(t) \ 
C 9x (t) C ee (f) )' 


(ID 


where C X x(t) is diagonal matrix of size 40 x 40 and Cgg(t) is 
lxl for this example; and the diagonal entries of these sub¬ 
matrices are equal to 0.1% of the variance of the respective 
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Figure 1 . Comparison of the Lorenz-96 (top) model dynamics with the modified system {D (middle) and the coefficient (0/40 + 1) (bottom). 


variables in ( |10[ >. In this numerical experiment, we assume that 
the cross covariances C x g(t ) = Cg x (t) = 0, and moreover, notice 
that the covariance matrix C(t) is actually the same for every 
initial time, t, in this experiment. In Section[5] the semiparametric 
filter solutions will produce time varying covariances and cross¬ 
covariance which are nonzero. Each forecast method will be given 
the perturbed initial state (x(t),9(t)) and the covariance matrix 
C(t), and each method will attempt to forecast the true state 
x(t + mr) for m = 0,1,..., 50 time steps. 

Typically we are interested in problems where only a small 
number K of ensemble members, {x k (t)}^ =1 , can be integrated; 
and a good choice are the sigma points of the covariance ellipse 
( [Julier and Uhlmann 20041. The sigma points are given by the 
adding and subtracting the columns of the symmetric positive 
definite square root of the covariance matrix G xx (t ), to the initial 
state estimate x(t). For the persistence model, we simply fix 
6 = 9(t) and apply the ensemble forecast to the parametric model 
with this fixed value of 9. For the MSM parameterization, we 
used the analytic solutions to the Ornstein-Uhlenbeck model (fit 
to the training data {9{\) to forecast the initial mean 1 9(f) and 
covariance Cgg (f) and we used the forecast mean at each discrete 
time step as the fixed parameter 9 for the ensemble forecast of 
the parametric model. We also tried using the MSM forecast of 
the covariance to sample an ensemble of parameters, {# fc }, from 
the Gaussian forecast density, however we found that this was 
unstable for this problem. For the HMM parameterization, we 
immediately ignore 9(t) by drawing samples of initial conditions 
{9 k (t)}^ =1 from the equilibrium density, which means that each 
ensemble member 9 k is chosen by selecting a random index 
m € {1, and setting 9 k = 9 m from the training data set. 
The ensemble of {9 k } is combined with the ensemble {+} 
and the parametric model is used to forecast one step and then 
the parameter ensemble {# fc } is resampled and the process is 
repeated. For the perfect model, we generate an ensemble of 
{(x k (t),a k (t),a 2 (t),a k (t)}^ =1 using the sigma points of a 43 x 
43 diagonal covariance matrix with diagonal entries equal to 0.1% 
of the long term variance of each variable and then perform 
the ensemble forecast using the full model ( |10[ ). Note that the 
initial state for the full model is given by ai(t) = 40 {9(t) — 1) 
and (a 2 (t), a 3 (t)) = (a 2 (t), a 3 (t)) + (r/ 2 (t), r/ 3 (t)) where 
are Gaussian with mean zero and variance equal to 0.1% of the 


variance of the respective variables. To initiate the nonparametric 
forecast in the semiparametric forecasting algorithm, we form 
an initial density for the parameters using a Gaussian p(9i,t) = 
exp (— \\0(t) — 9i\\c ge ) which we normalize by dividing p(9i,t) 
by the normalization factor (8}. We then use the rejection method 
to sample {9 k (t)}if =1 and perform the semiparametric forecast 
iteratively as described in Section [372] 

The above methods are applied to 1000 initial conditions, 
forecasting for a total of 50 time steps of length r = 0.1, which 
corresponds to 5 model time units, or equivalently 25 model 
days (using the convention of 1 model day being 0.2 model time 
units of l jLorenzfl963| ). The results are shown in Figure [2] with 
the root mean squared error (RMSE) averaged over the 1000 
initial conditions. With only 86 ensemble members, the ensemble 
forecast using the perfect model produces the best short 
term prediction, but it also seems to produce a biased forecast. 
This bias is shown by the RMS error rising above that of the 
climatological error in the intermediate term beyond 12 model 
days (which corresponds to 24 forecast steps, or 2.4 model time 
units). In contrast, the RMS error of the semiparametric forecast 
grows slightly more quickly than the perfect model initially, but 
the forecast is unbiased in the intermediate term, approaching the 
climatology without exceeding the climatological error. In fact, 
after 8 model days the semiparametric forecast is producing a 
better forecast than the perfect model. This is due to the samples 
of the nonparametric forecast being independent samples of the 
forecast density, rather than being constrained to follow the paths 
of the initial ensemble members. Notice that the HMM, which 
immediately samples the equilibrium density p e q by drawing 
random samples from the training data, initially matches the 
forecasting skill of the unmodified Lorenz-96 model of 0, and 
in the long term the HMM is unbiased whereas the unmodified 
model is very biased. Finally, the persistence model and the MSM 
use the initial condition 9(t ) to improve the very short term 
forecast, but in the long term these methods do not capture the 
true climatology of 9, leading to a biased intermediate to long 
term forecast. In fact, the persistence model is so biased that the 
forecast model diverges catastrophically; this is because the model 
in m does not conserve energy when 9 is held fixed at a value 
other than one. By varying 9 in time, with a mean value of 1, 
all the models except for the persistence model are able to give 
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a stable forecast. In the short term, the semiparametric forecast 
is able to provide an additional 1-2 model days of equivalent 
forecast skill compared to the standard methods considered here, 
and almost matches the perfect model forecast (see the 7 day 
lead-time forecast in the bottom panel of Figure [2] during the 
verification period 770-880 time steps). In the intermediate term 
forecast, the lack of bias in the semiparametric forecast provides 
a significant improvement in forecast skill, even compared to the 
perfect model for this small ensemble size. 




problem, but rather to present a proof-of-concept approach which 
shows that extracting a time series of 9 is possible in a reasonably 
difficult context, and that the imperfectly recovered time series of 
9 can still be used to improve forecasting via a semi-parametric 
model. 

The approach we take here follows closely the standard state 
augmentation approach for parameter estimation. We will recover 
the time series of 6 as part of an ensemble Kalman filter applied 
to recover the state variable x from the noisy observations y. This 
approach requires augmenting 0 with a model for 9. If 9 were 
simply a constant parameter we could use the persistence model, 
9 = 0, and this corresponds to the standard state augmentation 
approach to parameter estimation jFriedlandj 11969[ >. Instead, 
because 9 is changing in time, we will assume a white noise 
model, 9 = \fQggW. A typical approach is to empirically tune 
the value of Qgg ( |Friedland||1982| > possibly using some a priori 
knowledge of the variable 9. Intuitively, the true signal 9 is being 
treated as a realization of a white noise process, and so a good 
choice for Qgg would be the variance of the time series {#;}. Of 
course, since recovering the time series 9, is our goal, we do not 
assume that the variance is known, and instead we use a method 
of adaptive filtering which can automatically recover Qgg. The 


method used here to find Qgg is detailec 

in Appendix [A] and is 

a version of the approach introduced in ( 

Berry and Sauer 2013) 

and applied in (Berry and Harlim 2014a 

), which is a nonlinear 

generalization of an earlier approach in 

Mehra 1972). We note 

that a closely related technique was simultaneously developed in 
(Harlim et a/.|[2014) and is refined in (Zhen and Harlim, 2014). 


Intuitively, these methods use the error between the predicted and 
actual observations (which is also known as the innovation), to 
determine the amount of noise which must have been added to 
the state in order to explain this error. By analyzing correlations 
between the innovations at different lags it is possible to find the 
observation covariance R as well as observable parameters of the 
dynamical noise, such as Qgg. 

Once the value of Qgg is found from the adaptive EnKF, we use 
a standard EnKF to solve the following filtering problem. 


x = f(x, 9) 

6 = VQeeW ( 12 ) 

y° = y + n = K x ) + 7, 


Figure 2. Comparison of forecasting methods for the model m with e = 1, given 
perfect training data and perturbed initial conditions for the parameter, 6. The top 
panel shows the forecast RMS errors as functions of forecast lead time and the 
bottom panel shows the 7 day lead-time forecast during the verification period of 
770-880 time steps. 


4. Extracting a time series of hidden parameters 

In this section, we will consider the more practical situation in 
which the historical time series of the parameters {9i} is not 
available. We will show that it is possible to extract such a training 
data set from noisy observations y° = y + y = h(x) + y where 
y ~ J\f(0 , R) is a Gaussian observation noise. Of course, this 
recovered data set will not be perfect and it will not exactly obey 
the dynamics of 9 in < [ 10[ >. In Section [5] we will show that the 
recovered data set still allows for training a useful nonparametric 
model which improves the forecast. If 9 were fixed, or we had a 
parametric form for the dynamics of 9, then extracting the time 
series of 9 from the noisy observations y° would be a standard 
inverse problem. However, we would like to recover a time series 
of values of 9 without knowing anything about the evolution of 
9. This type of inverse problem should require at a minimum the 
same observability conditions as that of finding a fixed parameter 
9. Our goal here is not to give a formal exposition of this inverse 


where the goal is to recover the optimal estimates of (x, 9) 
at each time ti, given all the observations y°{tj) for j < i. 
Note that in the examples in this paper there is no stochastic 
forcing on the x variables, however this approach would allow 
an additive white noise forcing term of the form x = f(x,9) + 
s/QxxWx- Furthermore, we will only consider examples where 
the observation function h has no explicit dependence on 9 since 
having direct observations of 9 would typically make recovering 9 
less challenging. In Figure [3] we show the results of applying this 
algorithm for the model l| 1 Ojusing an unscented ensemble Kalman 
filter ( jjulier and Uhlmann|2004| ) with 82 ensemble members (the 
filter equations used here are the same as those in Section[5]for the 
parametric model). 

In this example, we use a time series of 5000 noisy observations 
y°(ti ) with At = — ti = 0.1, the observation function is 

h(x, 9) = x, and the observation noise variance is R = 5/40- In 
Figure [3] we show the true time series of 9 and the recovered 
time series for e = 0.25,1,4. In each case, the the true variance 
of 9 is 0.039, and the estimated variances are 0.0192, 0.0193, and 
0.0066 for e = 0.25,1,4 respectively. When the Lorenz-63 system 
evolves more slowly, as in the cases e = 1,4, a very accurate 
recovery is possible. In contrast, when e = 0.25, the Lorenz-63 
system evolves quickly relative to the observations y of the state x, 
which makes clean recovery challenging, however the recovered 


(c) 0000 Royal Meteorological Society 


Prepared using qjrms4.cls 



































































8 


T. Berry and J. Harlim 



0 . 6 - 


°' 5 0 50 100 150 200 250 

Time (time steps, A t = 0.1) 

1.5,-,-,-,-,- 



0 ' 5 0 50 100 150 200 250 


Time (time steps, At = 0.1) 



Figure 3. Comparison the true time series of 6 and the recovered time series for 
e — 0.25 (left), e = 1 (middle), and e = 4 (right). 


time series still has a high correlation with the true dynamics. 
Obviously, one would get worse recovery when observation 
function h is a more complicated operator (for example, if the 
observations of the state x are sparse) or when the observation 
noise covariance R is larger. While these issues can be important 
in practice, we omit exploring them here since it is more related 
to the accuracy of the primary filter (EnKF) and our assumption 
is that 9 can be recovered with an appropriate ensemble filtering 
method from the existing ensemble forecasting infrastructure. 

In Section[5]we will compare the filtering and forecasting skill 
of the semiparameteric approach to the various parametric models 
introduced in Section[2]for these three time scales (e = 0.25,1,4). 
As we will see, even when the parameters evolve quickly the 
recovered data set is still useful in overcoming the model error. 


5. Semiparametric filtering 

In this section we assume that a training data set has already 
been recovered using the method of Section [4] and our goal 
is to estimate the current state, x(t), and the posterior density 
p(9, t) at some time t given a sequence of noisy observations 
y°{s) for s < t, which will serve as the initial conditions for the 
forecasting problem of Section [5] Recall that the nonparametric 
forecasting procedure involves writing a density in the basis {ytj} 
and propagating the coefficients Cj forward in time with a linear 
map A, where A and ipj are built using the time series extracted 
by the method in Section [4] Therefore, we are actually interested 
in the coefficients Cj(t) which describe the conditional density 
p{9(t) | y°(s), s < t ) written in the basis ipj. In other words, given 
all of the observations up to and including the current observation, 
we would like to find the optimal description of the current state. 
To solve this problem, we need to combine our nonparametric 
forecast model with the parametric filter to form a semiparametric 
filter. 

The goal of semiparametric filtering is to combine the 
parametric model, /, for the evolution of x with a nonparametric 
model for the evolution of 9 to determine the filtered state 
p(x(t), 9(t) | y°{s), s <t) from the previous filtered state p(x(t — 
r),9(t — r)\y°(s),s < t — t). The algorithm introduced here 
will follow the standard forecast-assimilate paradigm for filtering, 
in which the forecasting step will be to perform a one step 
forecast to determine p(x(t), 8(t) \ y°(s), s < t — r) and the 
assimilation step uses the observation y°(t) to find the final 
density p(x(t), 0(t) \ y°(s), s < t ). 

We introduce the notation x k,a and 9 k,a to denote an ensemble 
of states x and parameters 9. The superscript k denotes the k- 
th ensemble member, and the superscript ‘a’ stands for analysis, 
indicating that this ensemble has assimilated all the information 
up to the current time, including the current observation. 
Similarly, x k ^ and 9 k ’f denote the forecast (also known as 
‘prior’) ensemble, as indicated by the superscript ‘f’, which 
account for the observations up to the previous time step. To be 
consistent, we also use same superscripts ‘a’ and ‘f’ to denote 
the nonparametric analysis density p a {9 , t) and the nonparametric 
forecast density pf ( 9 , t ) as well as the associated coefficients c a 
and c f respectively. 

We overview the semiparametric filtering algorithm with the 
diagram shown in Figure [4] As shown in Figure [4] information 
will flow up from the density to the ensemble in the forecast 
step and back down from the ensemble to the density in the 
assimilation step. The basic idea of the semiparametric filter is 
to apply the EnKF to estimate a Gaussian approximation to the 
posterior p(x, 9(t) \ y°{s), s <t) and then to treat the parametric 
filter estimate 9 a (t) as a noisy observation which we feed into a 
secondary nonparametric filter. The forecast step for 9 can then 
be performed using the nonparametric model, and the first two 
statistics of this forecast are substituted into the Kalman filter prior 
statistics for 9. 

To make this procedure precise, assume that we are given 
an estimate of the state (x a (t — r), 9 a (t — r)) along with a 
covariance matrix C a (t — r) of the form CD and coefficients 
c a (t — t) which represent the current posterior density p a {9 , t — 
t) given all the observations y°(s) for s <t — r. The filtering 
procedure is iterative, so given this information, it will assimilate 
the next observation y°(t) and produce ( x a (t ), 9 a (t)), C a (t ) and 

c°(f). 

To perform the forecast step of the semiparametric filter, we 
first form an ensemble {(x k ’ a (t — r), 9 k,a {t — t))}^ =1 of states 
which match the statistics ( x a (t — r),9 a (t — t)) and C a (t — r). 
In all the examples in this paper we use the unscented square 
root ensemble (jjulier and Uhlmann|2004[> as described in Section 
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o x k ’ a (t-T),o k ’ a (t-T )) - EnKFassimilalesy ° (t) > (x k ’ a (t),e k ’ a (t)) 


P a (e,t-r) 

C a (t~T) 


Nonparametric Forecast 


9 k ’ f (t ) 


p f (9,t) 


E j ct <PjPe q 


P (e“(t)|9(t)) 


p f ( 6 )p°-( 6 a \ 6 ) 


-+ P°(f,t) 


(p a ,Vj) 


A lj= n^,S Vl)p ^] ^ g (t) = A _, Q (t _ T) _ _ BayesianUpdate _ ^ ^ _ 


Figure 4. Diagram summarizing the semiparmetric filtering algorithm. 


|3.3| Next, we use the model 0 to integrate this ensemble 
forward in time to find the forecast ensemble {x k ’f(t)} (also 
known as the background ensemble). In order to update the 
statistics of 0, we run a single diffusion forecast step c‘(t) = 
Ac a (t—T). Using the ensemble {x k ’* (t)} and the coefficients 
c f (f), we compute the forecast statistics (also known as the prior 
statistics or background statistics) by mixing the parametric and 
nonparametric information as follows. 


x^ (t) 

cL{t) 

8 a (t — t) 

cUt) 


e f (t) 

c/eW 

C f (t) 


j<'52 xk ' f ( t ) (13) 

k =1 

K 

~ xf ^(x k {t) - x f {t)) T 

k =1 

k =1 

^(*) T 

A (; x kJ (t) - x f (t)){O k ’ a (t - r) - 9 a (t - r)) T 

^ K — 1 

k =1 

M 

J2c j {t)^Pj(9 z )pec j(0i) 
i=i 


/ (0-^(t))(0- 

JoeM 
N 




i=l 


N 

y^9jp f (9j,t) 

i =1 

-9 f (t)) r p f (9,t) 

- 9 f (t)) T p f (9 h t ) 


/ Cxx(t) C^g(t) \ 

V <?«*(*) C/ S (i) j ' 


Notice that we use the ensemble to compute the statistics of 
a; and the correlation between a: and 6 1 , while we use the 
nonparametric model to compute the statistics of 8. Moreover, 
the cross covariance terms C xS (t) = Cg x (t) T simply use the 
analysis ensemble 9 k ’ a (t — r) from the previous time step. This is 
required since we do not integrate the individual initial conditions 
9 k,a , but rather the full density. Using the semiparametric 
forecast mean (t) = 6? (t)) and covariance we 

now resample the forecast ensemble z k ’^(t) = (x k ’f (t),8 k, f (t)), 
(we use a square root ensemble here) and for convenience we 
will use the same symbols for the resampled forecast ensemble. 
Naively, one might hope to simply apply the rejection sampling 
strategy to generate 9 k, f , and keep using the ensemble x k 'f 


from the ensemble forecast without resampling; however, this 
strategy does not preserve the cross-covariance structure between 
x and 8 from the previous filter step. We attempted this 
alternative strategy and found that it significantly degraded the 
filter performance compared to resampling the ensemble z k 'f (t) 
using ©. With the newly resampled ensemble, we form the 
observation ensemble y k ’^(t) = h(x k 'f(t), 9 k 'f(t)) by applying 
the observation function to the reformed ensemble. Notice that 
in general the observation function may depend on both the state 
* and the parameters 8, but in the examples below we will only 
consider the more difficult case where the observation function 
only depends on x (so that 8 is not directly observed). 

With the forecast step completed, we now perform the 
assimilation step, which will consist of an EnKF assimilation step 
and a secondary nonparametric assimilation step. We first perform 
the EnKF update in order to assimilate the information given by 
observations, y°(t), at time t, with the standard equations, 

, K 

y f (t) = j7^2 yk,f {t) (14) 

k=1 

At) = 

k=1 

1 K 

Czyit ) = — ^(z kJ (t) - z f {t)){y k ' f (f) - y f {t)) T 
k=1 

, K 

Cyyit) = — ^2(y kJ (t) - y f {t)){y kJ (t) - y f {t)) T 
k=1 

K{t) = Czy{t)Cyy(t) 

z a (t) = (x a (t), 8 a (t)) = z f (t) + K(t)(y°(t) - y f (t)) 

C a (t)=C f (t)-K(t)C yy (t)K(t) T . 

The ensemble Kalman update © yields the analysis mean 
(x a (t), 9 a (t)) and covariance C a (t) from the parametric model, 
and it remains to update the nonparametric model coefficients 
c-'(f) in order to find the analysis coefficients c a {t) which are 
required in order to iterate the filter. Notice that if we do not update 
these coefficients to incorporate information from y°(t), they will 
evolve to represent the equilibrium density p e q , and the forecast 
8‘ will eventually be constant, so this secondary assimilation step 
is crucial. 

In order to update c*(t), we will take the posterior estimate 
8 a (t) from the EnKF, and treat it as a noisy observation, 
which is reasonable since it is only an estimate of the true 
state, and the filter also gives us an error estimate, namely the 
analysis covariance, Cg S (t). If the observations were continuous 
in time, we could filter this observation using the Zakai equation 
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projected onto the basis {pj}, and this was the approach taken 
in |Berry and Harlim| ( |2014b| >. However, since the observation 
time may be long, we will apply a Bayesian update which was 
introduced in |Berry and Harlim] ( |2015aj ). Explicitly, we want 
to find the posterior density p a (6,t) = p(9,t\y°(s),s < t ) by 
combining the forecast density p'(9, t) = p(9, t\y°(s),s < t) = 
(t)ipj(9)p e q(9) from the nonparametric model with the 
EnKF analysis estimate 9 a (t). Since the ensemble Kalman filter 
assumes the analysis solutions to be Gaussian, we have the 
following marginal analysis density for 9, 

p{9\y°(t)) oc exp(-^\\9 - 9 a (t)\\c Se ). (15) 

We will treat this Gaussian analysis density as a likelihood 
function, that is, we define the likelihood function p(9 a (t) \ 9) = 
p(9 | y°(t)). We now use this likelihood function in the following 
Bayesian assimilation scheme, 

p a (0, t) oc p f (9, t)p(9 a (t) | 9) 

M 1 

oc W:Peq(0)exp(- -|| 9-9 a (t)\\c- e )- (16) 

i=i 

By evaluating ( | 1 6| t at each data point 0, from the training data 
set, we can find p a (9,t) up to a normalization factor which we 
compute with <(8]». We can then compute the updated coefficients 
c?(t) by projecting the analysis density p a (9, t) onto the basis ipi 
by computing. 


N 

igagl OT( (>,). <i7) 

With the updated coefficients c a (t) from © we are now ready 
to perform the next filter step. Moreover, at this point we can 
apply the semiparametric forecasting algorithm of Section[3]to the 
current state, represented by x°(f), C xx (t) and c a (t) to estimate 
the future densities p(9, t + mr \ y°(s), s <t) at future times t + 
mr, given only the information up to the current time, t. 

Initializing the mean (x“(0), 0 a (O)) and covariance C a (0), 
where time zero simply represents the first observation one desires 
to assimilate, is a standard problem in filtering and in this paper 
we will use the true initial state (x(O),0(O)) along with an 
empirically chosen diagonal covariance matrix with C xx = /40 
and Cgg = Qee- To initialize the coefficients c a (0) we will use a 
non-informative prior, meaning that the initial density p a (9, 0) = 
Peq(0) so that c“(0) = (peq, <fij) « ^ 

5.1. Semiparametric filtering and forecasting a system with 
chaotic model error 

In this example we apply the semiparametric filtering technique 
to the system © for e€{0.25,l,4} given only the model 
f(x, 9) and a time series of 5000 noisy observations y°{ti) with 
At = tj-|_i — ti =0.1. Implicitly, this means that we are applying 
the complete procedure, including estimating the variance of the 
parameters 0 as in Appendix [A] recovering a time series of the 
parameters 0 as in Section]?] building the nonparametric model for 
0, including applying diffusion maps algorithm to obtain the basis 
1 fij(9i) and the forecasting matrix A as discussed in Section 3.1 
implementing semiparametric filtering as described in Section |5 
and semiparametric forecasting as described in Section [5] Note 
that we apply the time delay embedding to the recovered time 
series 9i and use this embedded data set to build the nonparametric 
model, including the equilibrium density p e q (0j), the basis tpj ( 9i) 
and the forecasting matrix A. Finally, we apply the semiparametric 
filter to an out-of-sample collection of 1000 noisy observations 


y°(ti) where i = 5101,..., 6100. At each step, using the filter 
analysis as the initial conditions for the semiparametric forecast, 
we compute the forecast for 50 steps of length r = Af = 0.1 
(which is 25 model days using the convention of 1 model day = 
0.2 model time units) and we compare the mean of the forecast to 
the true state x at the corresponding lead times. 





Time (model days) 


Figure 5. Comparison of forecasting methods for {H with recovered training data 
and filtered initial condition for e = 0.25 (left), e = 1 (middle), and e = 4 (right). 


In Figure [5] we compare the semiparametric filter to alternative 
methods of correcting for the model error, all of which use the 
same training data set for 0,;. Notice that when the standard 
Lorenz-96 © model is used without any modification, the 
ensemble Kalman filter cannot even estimate the initial state x, 
and the forecast is worse than the climatology of the true system. 
The perfect model produces the most accurate forecasts for larger 
£ = 1,4. Notice that for e = 1, the perfect model in Figure[5]does 
not exhibit the bias as seen in the long term forecast of Figure 
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Time (time steps, At = 0.1) 


Figure 6. Demonstrating the evolution of the system m with e = 4 (top) with the corresponding time series of the latent variable 7 (middle) and the first parameter 6 1 
(bottom, black curve) along with the recovered time series of 6\ estimated with the filter (bottom, red curve). 


[2] This is due to the ensemble from the filtered state having non¬ 
diagonal covariance structure which places the ensemble closer 
to the attractor of the true system © and reduces the bias in 
the forecast. As shown in Figure [5] when e = 0.25 the system 
© is sufficiently stiff that the perfect model does not produce 
good filtered estimates and consequently the perfect model is 
outperformed by the alternative approaches. 

The semiparametric filter and forecast give the next best 
forecasting skill and this approach is robust across the different 
values of e. When the Lorenz-63 system evolves more slowly 
(e = 1,4), the semiparametric model has a significant advantage 
over all except for the perfect model. The next best approach is 
HMM, which samples randomly from the training data set (see 
|Kang and Harlim[2012| for detailed implementation of the HMM 
in the data assimilation procedure). This approach matches the 
semiparametric method when the Lorenz-63 system is very fast 
(e = 0.25), which agrees with the theory of HMM in E et al. 
( |2007| l, because when e is small the Lorenz-63 system goes to the 
equilibrium density very quickly relative to the Lorenz-96 system 
which means that forecast model A cannot provide any additional 
information to the forecast. Finally, MSM performs quite well for 
small c, but the predictive skill degrades severely as e increases 
(catastrophic divergence in the case of e = 4). The persistence 
model forecast exhibits the catastrophic divergence for all values 
of e since the model does not conserve energy. 

5.2. Semiparametric filtering and forecasting a system with 
stochastic model error 

In this example we apply the semiparametric filtering technique 
to the following system with four parameters 0 = (0i,..., Of), 
each parameter 9j is paired with the 10 state variables 27 with 
ceil(i/ 10 ) = j, 

f(Xii 9 ) = ^ C eil(i/10)— l't't+l Xi—\Xi—2 Xi + 6 

9j = 1 + (3/10) sin (7 + (tv / A)j) (18) 

7 = -(2 - sin( 27 )/ 2 )/e + \J.l/e W, 

for £€{0.25,1,4}. Although the parameter space 9 is four 
dimensional, these parameters actually lie on a one-dimensional 


domain M embedded in the ambient 4-dimensional space. In 
this example, we set this one dimensional domain to be a 
circle with latent coordinate 7 which has a state dependent drift 
and a small amount of diffusion on the circle. We chose this 
example to demonstrate the semiparametric model on a higher¬ 
dimensional parameter space 9 e R 4 with stochastic evolution. 
Since the parameters 9 are actually constrained to a one¬ 
dimensional physical domain, we expect to learn the model for 
the parameters with a very small training data set of noisy 
observations. Finally, this example will demonstrate the strength 
semiparametric approach since it is nontrivial to represent the 
hidden latent variable 7 with a parametric ensemble forecast. 
Given only the model f(x,9) and a time series of 5000 noisy 
observations y° with At = 0 . 1 , we apply the method of Appendix 
|A|to estimate the 4 x 4 covariance matrix of the parameters 9, and 
then we apply the method of Section [4] to recover the time series 
of 9. As an example, we show the recovered estimate of the time 
series of 9\ is compared to the true time series of 9\ in Figure [ 6 ] 
for e = 4. 


Even though the state 7 is fully observable from the four 
dimensional time series 9, we still apply a time delay embedding 
to the recovered training time series with a single lag to smooth 
out the remaining noisy recovered time series. Notice in Figure 
[ 6 ] that the reconstruction of the true time series of parameters 
is somewhat noisy. It was shown in Berry and Harlim (20 15a[ >, 
applying the theory of |Berry et al. i 2013| >, that the delay 
embedding can be used to reduce the influence of the noise 
on the nonparametric model. We note that while there is not a 
strong dependence on the number of lags used in the time delay 
embedding for this example, the theory introduced in Berry et al. 
( |2013| ) shows that as the number of delays increases, the geometry 
of the attractor becomes biased towards the stable components of 
the evolution, which are orthogonal to the noise. 

We learn the nonparametric model from 5000 recovered 
training data points of 9, and apply the semiparametric filter to 
an out-of-sample collection of 1000 noisy observations y°(ti) 
where i = 5101,..., 6100, computing the forecast up to the 50 
step (5 model time unit, 25 model day) lead time as in the 
previous example. In Figure [7] we see that the nonparametric 
filter significantly outperforms the perfect model filter across all 
three time scales. The poor performance of the perfect model is 
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due to the small ensemble size, with only 82 ensemble members 
unable to capture the evolution of the stochastic system 
Notice that any small perturbation of the latent variable 7 leads 
to significant variability in the high-dimensional chaotic evolution 
of x, so sufficiently sampling the uncertainty of 9 is difficult with 
a small ensemble of only 82 paths. In contrast, as previously 
noted, the semiparametric model does not produce paths of the 
evolution of 7 , but rather samples independently from the forecast 
density at each forecast step. Only when the system is very 
slow, e = 4, is there a comparable filter, and in this case simply 
using the persistence model gives reasonable estimates. However, 
while the persistence model gives a reasonably good forecast in 
the very short term for the case e = 4, the long term forecast 
quickly degrades and in this case the forecast is unstable due 
to the dynamics ( | 1 being unstable when 9 is fixed at a value 
far from (1,1,1,1). Attempting to use the standard Lorenz-96 
system ([9]) fails to filter the observations due to the model error, 
and while both the perfect model and the HMM give stable 
filters and forecasts they are significantly outperformed by the 
semiparametric model. 

6. Conclusion 

The goal of semiparametric modeling is to blend the strengths of 
parametric and nonparametric models. In particular, parametric 
models are able to use prior knowledge to describe intrinsically 
high-dimensional phenomena, however they are vulnerable to 
model error arising from unresolved phenomena or truncated 
scales. On the other hand, nonparametric models learn directly 
from the data and place very few prior assumptions on the data, 
however the data requirements grow exponentially in the intrinsic 
dimensionality of the problem. 

For example, if one assumes that the most of the low frequency 
variability of the atmospheric dynamics, such as the Rossby 
waves pattern or North Atlantic Oscillation, can be captured 
by parametric models (such as the quasi-geostrophic models) 
then intuitively, if the model error is low-dimensional, it can be 
captured by a nonparametric model. In this paper, we introduce 
a modeling framework which can blend these approaches. We 
test our approach on a simple example, in which model error is 
introduced into the Lorenz-96 model with a parameter governed 
by either the Lorenz-63 model or a one-dimensional stochastic 
model. These examples were chosen as challenging test cases for 
semiparametric modeling in the following sense: The advection 
term in © or l |18| > does not actually conserve energy due to the 
parameter 9. Of course, the average value of 9 in ( | 1 ()[ > or © 
is 1, so averaged over a large enough time window the L96-L63 
advection does approximately conserve energy. As an analogy to 
a real physical problem, treating the L96-L63 model as the truth, 
one can imagine that a coarse physical model may have suggested 
using the Lorenz-96 model, which exactly conserves energy, while 
the true system does not. The difference between these two 
systems is very subtle as shown by the spatio-temporal patterns 
in Figure]!] However, this small model error has a dramatic effect 
on forecasting skill as shown in Figure|2]and even more so on the 
filtering as shown in Figure [5] The stochastic example in ( | 1 8| > is 
an even a more difficult test bed in the sense that model errors 
are introduced by four parameters whose intrinsic dynamics is 
only one-dimensional. One can imagine that a standard parametric 
modeling approach may end up over-fitting the one-dimensional 
dynamics with four separate parametric models. 

The examples discussed above suggest that coarse physical 
models can be significantly improved by the semiparametric 
modeling paradigm. Rather than choosing arbitrary models for 
unknown phenomena, the modeler can build a parametric model 
which captures all the known phenomena, and fit the remaining 
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Figure 7. Comparison of forecasting methods for the L96-stochastic model with 
recovered training data and filtered initial condition for e = 0.25 (left), e = 1 
(middle), and e = 4 (right). 


unknown phenomena with nonparametric modeling. Of course, if 
some knowledge of the parameters is available, one should first 
extend the parametric model as far as possible using real physical 
knowledge. Only when the physical knowledge is exhausted 
should the semiparametric modeling be applied, thereby reducing 
the dimensionality of the model error as much as possible before 
learning the nonparametric model. Choosing an appropriate 
parametric form to encode all the existing physical knowledge is a 
significant challenge, and the semiparametric modeling approach 
will need to take advantage of the many advancements in this field 
to reduce the load on nonparametric model. 

The algorithms introduced in this paper strive to be practical 
by seamlessly blending the nonparametric model into a standard 
existing framework for parametric filtering and forecasting. 
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However, there are two significant practical limitations of this 
framework. First, in order to extract a time series of the parameters 
(to be used for building the nonparametric model), one requires 
a minimum observability condition for the parameters, as in any 
standard inverse problem. Assuming that this condition is met, 
we foresee that improving the methodology for extracting the 
time series of the hidden parameters from noisy data is the key 
remaining challenge to lifting this framework to real applications. 
Indeed, the approach introduced in Section [4] and Appendix |A| is 
intended only as a proof-of-concept which shows that extracting 
an unobserved time series of parameters is possible. Second, 
the evolution of the parameters 6 = g(6 , W) is assumed to be 
independent of the state variable x of the parametric model. This 
assumption is currently required because the nonparametric model 
of |Berry et al. |j2015| ) assumes that the evolution represented by 
the times series is ergodic, and if g were allowed to depend on 
x the time series 0 would not be ergodic by itself. Of course, 
combining the state variables (x, 6) would typically yield an 
ergodic time series, however, this system would once again be 
high-dimensional and would not be accessible to a nonparametric 
model. Similarly, if g depended on x, the theory of Takens 
|Takens| ( |1981| >; |Sauer et al. \ ( |1991b| ) suggests that a time-delay 
embedding of 6 would attempt to reconstruct the attractor of the 
full system (x, 6) which again would again be limited by the curse- 
of-dimensionality. The most probable avenue of overcoming this 
restriction is to note that the condition probability of p{Q,t\x(t)) 
has an evolution independent of x. We expect that extending the 
theory of |Berry et fl/.|j20l5] ) to conditional densities of this form 
will be the key to constructing semiparametric models where the 
evolution of the parameters is state dependent. 
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A. Estimating the variance of unobserved and umodeled 
parameters 

In this appendix we consider the filtering problem, 

x = f{x,9) 

9 = g(9,W)*^QggW (19) 

y = h(x) + r\ 


the covariance matrix Q = 


of the full system 


where we have access to the noisy observations y at discrete 
times L, and the noise y is normal with variance R. We assume 
that the model / and the observation function h are known, but 
that the model g is unknown and we will instead use a white 
noise model g(9, W) = \/Qgg W. We will apply the method of 
( Berry and Sauer ^0l3] l to find the variance Qgg of the unmodeled 
parameters 0 in ( |19| ). The method of ( |Berry and Sauer|2013| l finds 

Qxx Qx6 
Qqx Qee 

( x,9 ). Thus the technique developed here will allow stochastic 
forcing with covariance Q X x on the x variable, although none of 
the examples in this paper have this feature. 

The method of ( |Berry and Sauerj |2013| > requires us to 
parameterize the matrix Q = X^=i qrQr where each Q r is a 
fixed matrix of the same size as Q and our goal is to find the 
parameters q r . The key to finding the parameter Qgg for the 
problem ( | 19[ > is to parameterize the cross covariance Qg x between 
the observed variables x and the unmodeled unobserved variables 
9. Letting x be n dimensional and letting 9 be m dimensional, 
for each entry ( Qg x )ij we introduce a parameter qi n +j which 
corresponds to a fixed matrix Qi n +j which has a 1 in the th 
position of Qg x and in the ( j , i)-th position of Q x g and is zero 
otherwise. Notice, that whenever the matrix Q = 9rQr is 

formed, the sub-matrix Qgg will be identically zero since there 
are no parameter matrices with nonzero entries in Qgg. Thus, 
whenever we form the matrix Q = QrQr we must project 

this matrix onto the nearest symmetric positive definite covariance 
matrix. We do this by finding the singular value decomposition 
Q = UAV t and then replacing Q with Q = UAU T . With this 
choice of parameterization of Q, along with the method of 


projecting on Q, we can apply the method of (Berry and Sauer 


|2013| to find the parameters q r from an adaptive EnKF. 
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